Skip to content

Add task.verify, dynamic Stripe keys, and capability warnings#3

Merged
amittell merged 7 commits into
mainfrom
feature/remaining-v02-tasks
Mar 31, 2026
Merged

Add task.verify, dynamic Stripe keys, and capability warnings#3
amittell merged 7 commits into
mainfrom
feature/remaining-v02-tasks

Conversation

@amittell

@amittell amittell commented Mar 30, 2026

Copy link
Copy Markdown
Owner

Summary

  • task.verify (closes feat: add task.verify for post-completion outcome verification #2): New verify field on tasks runs a shell command after completion to confirm the expected outcome exists. Supports on_failure: error|warn and configurable timeout.
  • Dynamic Stripe key minting: key_strategy: "dynamic" in the stripe-api-key provider now mints restricted keys via the Stripe API with automatic cleanup. Includes scope-based permission mapping and expiry management.
  • Capability warnings: validateManifestCapabilities now returns { errors, warnings }. Soft warnings surface when manifests use runtime_identity_resolution or credential_handoff (downscope) against schedulers that don't advertise those features.
  • Trust boundary docs: Architecture and execution-identity docs updated with scheduler/child trust boundary and principal separation model.

Test plan

  • 585 tests passing locally (19 verify + 45 dynamic Stripe + 6 capability + existing)
  • Lint clean
  • Verify task.verify: create manifest with verify.shell, execute, confirm pass/fail behavior
  • Verify dynamic Stripe: test with Stripe test-mode master key and restricted key minting
  • Verify capability warnings: apply manifest with provider identity to scheduler without runtime_identity_resolution

Summary by CodeRabbit

  • New Features

    • Post-task "verify" step (configurable shell, timeout, on-failure); verify outcomes now affect execution result, audit, and returned outputs.
    • Runtime downscoping with Stripe-backed dynamic restricted-key minting and best-effort revocation for child credential narrowing.
  • Documentation

    • Scheduler–child trust boundary, credential control-flow, and execution principal separation guidance; verify field documented in field reference.
  • Improvements

    • Capability validator now returns errors + advisory warnings; CLI surfaces warnings.
  • Tests

    • Expanded coverage for verify behavior and dynamic key lifecycle.

…ing v0.2 follow-ups

task.verify (agentcli#2):
- New verify field on tasks: shell command runs after completion to confirm outcome
- verify.shell (required), verify.timeout_seconds (default 30), verify.on_failure (error|warn)
- Full pipeline: validation, schema, compiler, exec (v1+v2), scheduler fields
- 19 new tests

Dynamic Stripe key minting:
- Implement key_strategy: "dynamic" in stripe-api-key provider
- Mint restricted keys via POST /v1/api_keys with master key auth
- Cleanup via DELETE /v1/api_keys/{id} (best-effort)
- Dynamic prepareHandoff mints narrower keys for child scope
- 45 new tests with mock HTTP server

Capability warnings:
- validateManifestCapabilities returns { errors, warnings } instead of flat array
- Soft warnings for runtime_identity_resolution and credential_handoff (downscope)
- Updated all callers (apply.js, cli.js, runtime adapter)
- 6 new tests

DRY dedup: already resolved (schedulerCreateSpec shared via apply.js import)
…ation

Add architecture docs explaining the credential flow control surfaces,
child credential policy enforcement model, and separation between
agentcli (control plane) and scheduler (execution runtime) roles.
Copilot AI review requested due to automatic review settings March 30, 2026 18:15
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a task-level post-success "verify" step (schema, validation, compile, runtime), surfaces capability warnings alongside errors, implements Stripe-backed dynamic restricted API key minting/handoff with cleanup, and documents scheduler/child trust boundaries and execution/control-plane identity separation.

Changes

Cohort / File(s) Summary
Documentation
docs/architecture.md, docs/execution-identity.md, docs/field-reference.md
New trust-boundary and execution-principal sections; added workflow/task verify docs and "Task Verify Fields".
Capability validation & CLI
src/capabilities.js, src/apply.js, src/cli.js, src/runtime/openclaw-scheduler.js
validateManifestCapabilities now returns { errors, warnings }; added non-blocking warnings for missing runtime identity/credential handoff; call sites emit and surface warnings.
Verify schema & validation
src/schema.js, src/validate.js
Added reusable verify schema; allow verify on workflows/tasks; implemented validateVerify enforcing shell (required string), timeout_seconds ≥1, on_failure ∈ {error,warn}.
Verify resolution & compilation
src/compiler/shared.js, src/compiler/openclaw-scheduler.js, src/scheduler-fields.js
Exported resolveVerify(workflow, task) with precedence/defaults; compiled jobs include verify_shell, verify_timeout_s, verify_on_failure; added fields to SCHEDULER_FIELDS_V02.
Verify runtime & auditing
src/exec.js
Threaded verify into execution paths; added runVerify (sync shell exec with timeout); integrated verify.on_failure semantics, included verify results in outputs/audits, adjusted effective success and audit-on-failure, and refined cleanup conditions for provider artifacts.
Stripe dynamic key strategy
src/identity/stripe-api-key.js
Implemented dynamic key_strategy: master-key resolution, permission mapping/encoding, Stripe create/delete helpers, minting of restricted keys for materialization and handoff, session embedding of minted key, and best-effort cleanup; exported internals for tests.
Schema projection & scheduler fields
src/schema.js, src/scheduler-fields.js
Extended manifest/schema and scheduler projection to include verify and scheduler-level verify_* fields.
Tests
test/agentcli.test.js
Updated tests for new capability return shape and warnings; added extensive verify validation/compile/exec tests and comprehensive Stripe dynamic key minting/handoff/cleanup tests (includes mock HTTP server).
Minor wiring
src/apply.js, src/runtime/openclaw-scheduler.js (small changes)
Stored capability warnings and adjusted destructuring at runtime adapter call site.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as Scheduler
    participant Executor as Task Runner
    participant Verify as Verify Runner
    participant Audit as Audit Logger
    participant Stripe as Stripe API

    Scheduler->>Stripe: (if dynamic) POST create restricted key (mint)
    Stripe-->>Scheduler: return key id/secret/expires
    Scheduler->>Executor: dispatch compiled job (includes verify, materialization/handoff metadata, narrowed creds)
    Executor->>Executor: run main task command (spawn)
    Executor->>Executor: capture exit_code, stdout, stderr

    alt exit_code == 0 and verify present
        Executor->>Verify: runVerify(shell, timeout_seconds)
        Verify->>Verify: spawnSync('sh', ['-c', ...]) with timeout
        Verify-->>Executor: {passed, exit_code, stdout, stderr, timed_out, duration_ms}

        alt verify.passed or verify.on_failure == 'warn'
            Executor->>Audit: append verify result (warning if applicable)
        else
            Executor->>Executor: mark verify_failed, effectiveOk=false
            Executor->>Audit: emit failure audit (includes verify result)
            Executor->>Scheduler: propagate verify_failed error
        end
    else
        Executor->>Audit: log main execution result (no verify)
    end

    opt cleanup/handoff for dynamic keys
        Scheduler->>Stripe: DELETE restricted key (best-effort revoke)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through schemas, shells, and keys,

I checked the run and salted logs with ease,
I minted tiny tokens, scoped so neat,
I ran a verify, then tucked results to keep,
A rabbit cheers these changes — nibble a carrot please 🥕


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds post-run verification for tasks, implements dynamic Stripe restricted-key minting with cleanup, and enhances capability checking to return both hard errors and soft warnings (plus updated trust-boundary documentation).

Changes:

  • Introduce task.verify / workflow.verify validation, compilation to scheduler fields, and execution-time verify behavior.
  • Implement stripe-api-key provider key_strategy: "dynamic" with Stripe API mint/revoke helpers and extensive tests.
  • Update capability validation to return { errors, warnings } and surface warnings in CLI/apply paths; expand architecture/security docs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/agentcli.test.js Updates capability tests to new return shape and adds extensive dynamic Stripe + verify test coverage.
src/validate.js Adds verify to known keys and validates verify blocks for workflows/tasks.
src/schema.js Extends manifest schema with verify fields and adds scheduler job verify columns.
src/scheduler-fields.js Adds verify fields to v0.2 scheduler field list.
src/runtime/openclaw-scheduler.js Adapts to { errors, warnings } return shape (uses errors).
src/identity/stripe-api-key.js Implements dynamic key minting + cleanup, adds helpers/constants, exports for tests.
src/exec.js Adds verify execution phase and returns/audits verify results; verify failures can flip ok / throw.
src/compiler/shared.js Adds resolveVerify() and includes verify in normalized task plans.
src/compiler/openclaw-scheduler.js Compiles verify settings to scheduler job fields.
src/cli.js Surfaces capability warnings in --check-capabilities output.
src/capabilities.js Changes API to return { errors, warnings } and adds new warning checks.
src/apply.js Prints capability warnings to stderr during apply while still gating on errors.
docs/execution-identity.md Adds security-model section describing principal separation/trust boundary.
docs/architecture.md Documents scheduler/child trust boundary and credential strategy model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/identity/stripe-api-key.js
Comment thread src/identity/stripe-api-key.js Outdated
Comment thread src/schema.js Outdated
Comment thread src/identity/stripe-api-key.js
…ey guard

- Embed session in materialize result so cleanup() can access stripe_key_id
  without callers needing to thread session through ctx
- Restrict http:// api_base to localhost unless allow_insecure_http is set,
  preventing accidental credential leak over plain HTTP
- Add min: 1 to verify_timeout_s in published schema
- Guard resolveMasterKey against missing/non-object master_key_source
@amittell

Copy link
Copy Markdown
Owner Author

Follow-up review pass pushed in e0e1e46.

Addressed two execution-path regressions in this PR:

  • task.verify now runs in the effective task cwd/env, so shell-local env vars and v0.2 materialized credentials are visible to the verify step.
  • Dynamic Stripe sessions now revoke minted restricted keys from exec, including dry-run handoff preparation, and cleanup is awaited so warnings/results are deterministic.

Also wired task timeout into dynamic session resolution and documented the new workflow/task verify block in the field reference.

Local verification: npm test, npm run lint.

@amittell

amittell commented Mar 30, 2026

Copy link
Copy Markdown
Owner Author

Superseded by the later follow-up comments on this PR. The previous body was malformed by shell escaping; the current review summary is in the newer owner comments and review notes.

@amittell amittell left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation

  • Ran npm test locally on the PR head: all tests passed (full node --test suite including new coverage for task.verify, Stripe dynamic keys, and capability warnings).

Summary

Solid, cohesive v0.2 follow-up: post-exec verify, dynamic Stripe restricted-key minting with cleanup and HTTP hardening, capability negotiation split into hard errors vs soft warnings, plus helpful architecture / trust-boundary docs. Tests are extensive and match the behavioral contracts described in commit messages.

Strengths

  • Verify pipeline is wired end-to-end: validation, schema (min: 1 on timeout), resolveVerify merge (task overrides workflow), compiler → scheduler fields, and both v0.1 / v0.2 exec paths with correct cwd/env (including materialized identity env in v0.2).
  • Stripe dynamic: master key resolution is guarded; api_base blocks non-local http: unless opted in; mint/delete paths are tested with a mock HTTP server; materialization embeds session so cleanup() can revoke keys without threading extra context.
  • Capabilities: returning { errors, warnings } is a clean upgrade; --check-capabilities exposes structured warnings for automation.
  • Docs clearly separate agentcli control plane vs scheduler runtime — good for security reviews.

Follow-ups (non-blocking; also left inline)

  1. Evidence vs verify (v0.2): Evidence is attested before the verify shell runs, so envelopes can show success even when verify later fails. Document or adjust if evidence is meant to cover “verify included” outcomes.
  2. apply --json: Capability soft-warnings only hit stderr; consider surfacing them in the JSON payload like --check-capabilities does.
  3. Stripe: Update stale resolveSession JSDoc that still says dynamic is unimplemented.
  4. Capabilities: Handoff vs downscope warning/error asymmetry is intentional but subtle — worth a sentence in field-reference.

Overall: No test regressions found; approve-worthy from a correctness/CI perspective with the above documentation/product clarifications as polish.

Comment thread src/exec.js
Comment thread src/apply.js
Comment thread src/identity/stripe-api-key.js
Comment thread src/capabilities.js
…c, capability cross-link

- Document evidence vs verify ordering in exec.js (evidence attests command
  outcome, verify is a separate operator-local deliverable check)
- Include capability warnings in apply JSON output (capabilities.warnings)
  so automation parsing stdout gets structured warning data
- Fix stale JSDoc claiming dynamic strategy is not implemented
- Add cross-link comment explaining hard error (presentation.handoff) vs
  soft warning (child_credential_policy downscope) asymmetry
@amittell

Copy link
Copy Markdown
Owner Author

One more follow-up commit is on the PR branch: e8a5808.

I found a second-order contract issue in the new verify path: the main task respected strict sandbox/network enforcement, but the post-success verify command was still shelling out directly. That meant verify could bypass the task contract on supported local sandbox runners.

This is now fixed by running verify through the same sandbox wrapper/profile when strict enforcement is active, with a regression test covering the darwin sandboxed path.

Validation rerun after the update: npm test, npm run lint.

Copy link
Copy Markdown
Owner Author

Polish follow-up pushed in 160f856.

This pass does not change runtime behavior. It tightens the review surface around the current PR head:

  • added a regression proving applyManifestToScheduler() keeps capability warnings in the structured result while still emitting the human-readable stderr advisory
  • documented that task.verify runs after evidence generation and is recorded separately from the attested command result
  • documented why identity.presentation.handoff is a hard compatibility gate while child_credential_policy="downscope" is only a warning when credential_handoff is unavailable
  • cleaned up the malformed duplicate owner comment and resolved the remaining stale non-blocking review threads

Validation rerun on this head: npm test, npm run lint.

@amittell amittell merged commit 435ef2e into main Mar 31, 2026
2 checks passed
@amittell amittell deleted the feature/remaining-v02-tasks branch March 31, 2026 02:17

amittell commented Apr 1, 2026

Copy link
Copy Markdown
Owner Author

Post-merge assessment on PR #3 (160f8564d7e8409be3aec93c03185c06d0613260) plus current main:

  • I did not find any remaining blocking issues in the merged PR changes themselves. The task.verify, dynamic Stripe key, and capability-warning work all looks coherent after the earlier follow-up docs/tests.
  • I did find one current-main runtime gap adjacent to this PR: delegated exec handoff to openclaw-scheduler was enforcing hard capability errors but dropping soft capability warnings from the structured delegation receipt. That made exec inconsistent with apply --check-capabilities / apply for the same manifest shape.
  • I also found that examples/ansible-ops.json had drifted out of schema (contract.audit: "full" instead of none | on-failure | always), which was breaking the repo’s all example manifests validate and compile test on clean main.

I pushed a small follow-up branch with both fixes:

Validation on that branch:

  • npm test: 592/592 passing
  • npm run lint: passing

The new regression test covers delegated exec returning warnings / capability_warnings when the scheduler lacks runtime_identity_resolution, and the example repair restores the green full-suite baseline on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add task.verify for post-completion outcome verification

2 participants